Search CORE

196 research outputs found

ZStream: A cost-based query processor for adaptively detecting composite events

Author: Madden Samuel R.
Mei Yuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Composite (or Complex) event processing (CEP) systems search sequences of incoming events for occurrences of user-specified event patterns. Recently, they have gained more attention in a variety of areas due to their powerful and expressive query language and performance potential. Sequentiality (temporal ordering) is the primary way in which CEP systems relate events to each other. In this paper, we present a CEP system called ZStream to efficiently process such sequential patterns. Besides simple sequential patterns, ZStream is also able to detect other patterns, including conjunction, disjunction, negation and Kleene closure. Unlike most recently proposed CEP systems, which use non-deterministic finite automata (NFA's) to detect patterns, ZStream uses tree-based query plans for both the logical and physical representation of query patterns. By carefully designing the underlying infrastructure and algorithms, ZStream is able to unify the evaluation of sequence, conjunction, disjunction, negation, and Kleene closure as variants of the join operator. Under this framework, a single pattern in ZStream may have several equivalent physical tree plans, with different evaluation costs. We propose a cost model to estimate the computation costs of a plan. We show that our cost model can accurately capture the actual runtime behavior of a plan, and that choosing the optimal plan can result in a factor of four or more speedup versus an NFA based approach. Based on this cost model and using a simple set of statistics about operator selectivity and data rates, ZStream is able to adaptively and seamlessly adjust the order in which it detects patterns on the fly. Finally, we describe a dynamic programming algorithm used in our cost model to efficiently search for an optimal query plan for a given pattern.National Natural Science Foundation (Grant number NETS-NOSS 0520032

DSpace@MIT

GRAPHiQL: A graph intuitive query language for relational databases

Author: Jindal Alekh
Madden Samuel R.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2014
Field of study

Graph analytics is becoming increasingly popular, driving many important business applications from social network analysis to machine learning. Since most graph data is collected in a relational database, it seems natural to attempt to perform graph analytics within the relational environment. However, SQL, the query language for relational databases, makes it difficult to express graph analytics operations. This is because SQL requires programmers to think in terms of tables and joins, rather than the more natural representation of graphs as collections of nodes and edges. As a result, even relatively simple graph operations can require very complex SQL queries. In this paper, we present GRAPHiQL, an intuitive query language for graph analytics, which allows developers to reason in terms of nodes and edges. GRAPHiQL provides key graph constructs such as looping, recursion, and neighborhood operations. At runtime, GRAPHiQL compiles graph programs into efficient SQL queries that can run on any relational database. We demonstrate the applicability of GRAPHiQL on several applications and compare the performance of GRAPHiQL queries with those of Apache Giraph (a popular `vertex centric' graph programming language)

Crossref

DSpace@MIT

MDCC: Multi-Data Center Consistency

Author: Fekete Alan
Franklin Michael J.
Kraska Tim
Madden Samuel R.
Pang Gene
Publication venue
Publication date: 01/01/2012
Field of study

Replicating data across multiple data centers not only allows moving the data closer to the user and, thus, reduces latency for applications, but also increases the availability in the event of a data center failure. Therefore, it is not surprising that companies like Google, Yahoo, and Netflix already replicate user data across geographically different regions. However, replication across data centers is expensive. Inter-data center network delays are in the hundreds of milliseconds and vary significantly. Synchronous wide-area replication is therefore considered to be unfeasible with strong consistency and current solutions either settle for asynchronous replication which implies the risk of losing data in the event of failures, restrict consistency to small partitions, or give up consistency entirely. With MDCC (Multi-Data Center Consistency), we describe the first optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols. MDCC can commit transactions in a single round-trip across data centers in the normal operational case. We further propose a new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication. Our evaluation using the TPC-W benchmark with MDCC deployed across 5 geographically diverse data centers shows that MDCC is able to achieve throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data center outage without a significant impact on response times while guaranteeing strong consistency

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

Top-K Queries on Uncertain Data: On Score Distribution and Typical Answers

Author: Ge Tingjian
Madden Samuel R.
Zdonik Stan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Uncertain data arises in a number of domains, including data integration and sensor networks. Top-k queries that rank results according to some user-defined score are an important tool for exploring large uncertain data sets. As several recent papers have observed, the semantics of top-k queries on uncertain data can be ambiguous due to tradeoffs between reporting high-scoring tuples and tuples with a high probability of being in the resulting data set. In this paper, we demonstrate the need to present the score distribution of top-k vectors to allow the user to choose between results along this score-probability dimensions. One option would be to display the complete distribution of all potential top-k tuple vectors, but this set is too large to compute. Instead, we propose to provide a number of typical vectors that effectively sample this distribution. We propose efficient algorithms to compute these vectors. We also extend the semantics and algorithms to the scenario of score ties, which is not dealt with in the previous work in the area. Our work includes a systematic empirical study on both real dataset and synthetic datasets.National Natural Science Foundation (Grant number IIS-0086057)National Natural Science Foundation (Grant number IIS- 0325838)National Natural Science Foundation (Grant number IIS-0448124

CiteSeerX

DSpace@MIT

Crossref

No Bits Left Behind

Author: Curino Carlo
Madden Samuel R.
Wu Eugene
Publication venue: CIDR Conference
Publication date: 01/01/2011
Field of study

One of the key tenets of database system design is making efficient use of storage and memory resources. However, existing database system implementations are actually extremely wasteful of such resources; for example, most systems leave a great deal of empty space in tuples, index pages, and data pages, and spend many CPU cycles reading cold records from disk that are never used. In this paper, we identify a number of such sources of waste, and present a series of techniques that limit this waste (e.g., forcing better memory locality for hot data and using empty space in index pages to cache popular tuples) without substantially complicating interfaces or system design. We show that these techniques effectively reduce memory requirements for real scenarios from the Wikipedia database (by up to 17.8×) while increasing query performance (by up to 8×)

CiteSeerX

DSpace@MIT

UPI: A Primary Index for Uncertain Databases

Author: Kimura Hideaki
Madden Samuel R.
Zdonik Stanley B.
Publication venue: Very Large Data Base Endowment Inc. (VLDB Endowment)
Publication date: 01/09/2010
Field of study

Uncertain data management has received growing attention from industry and academia. Many efforts have been made to optimize uncertain databases, including the development of special index data structures. However, none of these efforts have explored primary (clustered) indexes for uncertain databases, despite the fact that clustering has the potential to offer substantial speedups for non-selective analytic queries on large uncertain databases. In this paper, we propose a new index called a UPI (Uncertain Primary Index) that clusters heap files according to uncertain attributes with both discrete and continuous uncertainty distributions. Because uncertain attributes may have several possible values, a UPI on an uncertain attribute duplicates tuple data once for each possible value. To prevent the size of the UPI from becoming unmanageable, its size is kept small by placing low-probability tuples in a special Cutoff Index that is consulted only when queries for low-probability values are run. We also propose several other optimizations, including techniques to improve secondary index performance and techniques to reduce maintenance costs and fragmentation by buffering changes to the table and writing updates in sequential batches. Finally, we develop cost models for UPIs to estimate query performance in various settings to help automatically select tuning parameters of a UPI. We have implemented a prototype UPI and experimented on two real datasets. Our results show that UPIs can significantly (up to two orders of magnitude) improve the performance of uncertain queries both over clustered and unclustered attributes. We also show that our buffering techniques mitigate table fragmentation and keep the maintenance cost as low as or even lower than using an unclustered heap file.National Science Foundation (U.S.) (Grant IIS-0448124)National Science Foundation (U.S.) (Grant IIS-0905553)National Science Foundation (U.S.) (Grant IIS-0916691

DSpace@MIT

Partial replay of long-running applications

Author: Cheung Alvin K.
Madden Samuel R.
Solar-Lezama Armando
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Bugs in deployed software can be extremely difficult to track down. Invasive logging techniques, such as logging all non-deterministic inputs, can incur substantial runtime overheads. This paper shows how symbolic analysis can be used to re-create path equivalent executions for very long running programs such as databases and web servers. The goal is to help developers debug such long-running programs by allowing them to walk through an execution of the last few requests or transactions leading up to an error. The challenge is to provide this functionality without the high runtime overheads associated with traditional replay techniques based on input logging or memory snapshots. Our approach achieves this by recording a small amount of information about program execution, such as the direction of branches taken, and then using symbolic analysis to reconstruct the execution of the last few inputs processed by the application, as well as the state of memory before these inputs were executed. We implemented our technique in a new tool called bbr. In this paper, we show that it can be used to replay bugs in long-running single-threaded programs starting from the middle of an execution. We show that bbr incurs low recording overhead (avg. of 10%) during program execution, which is much less than existing replay schemes. We also show that it can reproduce real bugs from web servers, database systems, and other common utilities

DSpace@MIT

Crossref

Code In The Air: Simplifying Sensing and Coordination Tasks on Smartphones

Author: Balakrishnan Hari
Madden Samuel R.
Ravindranath Lenin
Thiagarajan Arvind
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2012
Field of study

A growing class of smartphone applications are tasking applications that run continuously, process data from sensors to determine the user's context (such as location) and activity, and optionally trigger certain actions when the right conditions occur. Many such tasking applications also involve coordination between multiple users or devices. Example tasking applications include location-based reminders, changing the ring-mode of a phone automatically depending on location, notifying when friends are nearby, disabling WiFi in favor of cellular data when moving at more than a certain speed outdoors, automatically tracking and storing movement tracks when driving, and inferring the number of steps walked each day. Today, these applications are non-trivial to develop, although they are often trivial for end users to state. Additionally, simple implementations can consume excessive amounts of energy. This paper proposes Code in the Air (CITA), a system which simplifies the rapid development of tasking applications. It enables non-expert end users to easily express simple tasks on their phone, and more sophisticated developers to write code for complex tasks by writing purely server-side scripts. CITA provides a task execution framework to automatically distribute and coordinate tasks, energy-efficient modules to infer user activities and compose them, and a push communication service for mobile devices that overcomes some shortcomings in existing push services.National Science Foundation (U.S.) (Grant 0931550

CiteSeerX

DSpace@MIT

Using The Barton Libraries Dataset As An RDF benchmark

Author: Abadi Daniel J.
Hollenbach Kate
Madden Samuel R.
Marcus Adam
Publication venue
Publication date: 06/07/2007
Field of study

This report describes the Barton Libraries RDF dataset and Longwell querybenchmark that we use for our recent VLDB paper on Scalable Semantic WebData Management Using Vertical Partitioning

DSpace@MIT